-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KYUUBI #6830] Allow indicate advisory shuffle partition size when me… #6831
Conversation
…hen merge small files
the use case is already covered by
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6831 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 687 687
Lines 42442 42439 -3
Branches 5793 5792 -1
======================================
+ Misses 42442 42439 -3 ☔ View full report in Codecov by Sentry. |
@pan3793 Yes, I hadn't noticed before. |
@yabola the real written size is affected by several things, e.g. the input data (might be from shuffle, or other data sources directly) compression codec and level, the data itself, the written format as you mentioned, the write compression codec and level, I don't think we can estimate that automatically and correctly |
@pan3793 emmm, but in the scenario of merging small files, we only need to consider the shuffle data size (this rule is only for shuffle data to file, doesn't matter what the data source is). |
I overlook this, you are right. I read the the Iceberg's code and understand the how it works, I am a little bit pessimistic to adopt it, because the real compression ratio is affected by data itself, the experience-based assumption of compression ratio is not always true, when the estimation deviates significantly from the real value, it's hard to explain to users how this happen. Files written by a Spark job are likely read by other Spark jobs, and the data will be covnerted to Spark InternelRow layout(same as Shuffle) again, have the compression ratio been considered on the read code path too? Instead of setting the DISCLAIMER: I'm not fully against adding the proposed feature if other people think it's a good idea, especially if they can provide some cases of actual benefits, as long as the feature is disabled by default. |
@pan3793 I will close this, if anyone finds it helpful, can revisit this ~ |
Why are the changes needed?
when merging small files(set
spark.sql.optimizer.insertRepartitionBeforeWrite.enabled
=true) , the default session advisory partition size (64MB) will be used as target. This default value can still lead to small files because the written data can be compressed nicely using columnar file formats (usually 1/4 or smaller of the shuffle exchange size, the result is often around 15MB).Spark now support configuring the rebalance expression advisory size in apache/spark#40421 . So we can have a configuration that can configure the merge size separately.
Was this patch authored or co-authored using generative AI tooling?
no